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Abstract 



1 Introduction 



Search engine companies collect the "database of in- 
tentions", the histories of their users' search queries. 
These search logs are a gold mine for researchers. 
Search engine companies, however, are wary of pub- 
lishing search logs in order not to disclose sensitive 
information. 

In this paper we analyze algorithms to publish fre- 
quent keywords, queries and clicks of a search log. 
How do their formal disclosure limitation guarantees 
compare and how easy can they be attacked in prac- 
tice? How much utility can they possibly provide 
theoretically and how useful are they for real world 
applications? 

We conduct a thorough comparison that includes 
both theoretical results as well as an experimental 
evaluation. In particular, we show how proposals 
to achieve anonymity [TJ [151 IH] can be attacked. 
The stronger guarantee of e-differential privacy un- 
fortunately does not provide any utility. We consider 
two relaxations. While for one of them an algorithm 
has been previously proposed ^01 we show how to 
guarantee the other strictly stronger relaxation. An 
extensive experimental evaluation compares the util- 
ity of the algorithms for two applications targeted at 
search quality and search efficiency. We find that our 
proposal to achieve a relaxation of differential privacy 
yields comparable utility to a proposal that achieves 
anonymity while at the same time offering a much 
stronger guarantee. 



Civilization is the progress toward a society of pri- 
vacy. The savage's whole existence is public, ruled 
by the laws of his tribe. Civilization is the process of 
setting man free from men. — Ayn Rand. 

My favorite thing about the Internet is that you 
get to go into the private world of real creeps without 
having to smell them. — Penn Jillette. 

It is hard to imagine the Web today without the 
easy access to search engines that help us to any infor- 
mation. Whenever a user submits a search query, the 
search engine logs the query and other information 
associated with it (for example, what links the user 
clicked on). The contents of these search logs enable 
much valuable research. It can be used for finding 
trends, patterns, and anomalies in the search behav- 
ior of users. It also can be used to develop and test 
new algorithms to improve search performance. To- 
day such research is mainly conducted within search 
engine companies. Search logs are not being released 
to the public or researchers outside these companies 
because of privacy concerns. 

The concerns of releasing search logs came out 
clearly in the AOL debacle of 2007 when AOL pub- 
lished three months of search logs of 650,000 users. 
The only privacy protection was the replacement of 
user-ids with random numbers. However, the queries 
of a user usually contain identifying information such 
as searches for addresses and local events or even van- 
ity searches for the user's name. Such information 
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can be linked to external databases to re-identify the 
user. This is what the New York Times did to track 
down a user [5]. Through her queries they were able 
to identify Ms. Thelma Arnold from Lilburn, Georgia. 
Her queries not also contained enough identifying in- 
formation but also sensitive information about her 
friends' medical ailments. In general, users are con- 
cerned that the release of a search log reveals their 
life and personality through their searches for dis- 
eases, habits, lifestyle choices, personal tastes, and 
political affiliations. 

The AOL debacle shows that the ad-hoc method 
of replacing user-ids with random numbers does not 
prevent such information disclosure^ Other ad-hoc 
methods have been studied and found to be insuffi- 
cient, such as the removal of names, age, zip codes 
and other identifiers [16] and the replacement of key- 
words in search queries by random numbers [21j . 

In this paper, we investigate principled ways to 
control the disclosure when publishing search logs. 
We conduct a comparison of various methods for lim- 
iting disclosure and answer the following questions: 
What formal guarantees of limiting disclosure are de- 
sirable and appropriately address the privacy con- 
cerns? How can these guarantees be achieved and 
how much useful information can be preserved? 

First we examine k- anonymity [32j . We find that 
proposals to achieve fc-anonymity in search logs pn, l27[ 
[T5] are insufficient in the light of attackers who can 
actively influence the search log by creating accounts 
and submitting queries themselves. 

Moreover, we illustrate with an example that k- 
anonymity does not prevent an attacker from learning 
that a particular user submitted a particular query 
(see Section [3]). 

Thus, we have to turn to a stronger privacy defi- 
nition such as differential privacy [12] . It has been 
successfully applied to publish contingency tables [S] 
and to solve learning problems [3 [18] . In the context 
of search logs it limits what an attacker can learn 
from the published search log (even if the attacker 
has multiple accounts and submitted queries herself). 

^http : //en. wikipedia. org/wiki/AOL_search_data_scaiidal 

has more information, including links to the resignation of 
AOL's CTO and the ongoing class action lawsuit against AOL 
resulting from the data release. 



Unfortunately, we can show that it is impossible to 
achieve good utility for publishing frequent keywords, 
queries, etc. under this definition (see Section [3]). 

To overcome this impossibility result we have to re- 
lax e-differential privacy. We consider two probabilis- 
tic relaxations: [e, 5) -probabilistic differential pri- 
vacy [24j and {e, 6)-indistinguishability |11| . A simple 
algorithm has been developed independently by Ko- 
rolova et al. [5D| and us [T3|. This algorithm was 
shown to guarantee (e, (5)-indistinguishability [20j . 
We offer a new analysis of the parameter settings un- 
der which it also guarantees (e, S) -probabilistic differ- 
ential privacy |24| in Section 15.21 This is interesting 
because probabilistic differential privacy is a strictly 
stronger guarantee (see Section [5] for a comparison of 
the two relaxations). 

Finally, we compare the utility of anonymity and 
privacy in an extensive experimental evaluation in 
Section [8] In our application-oriented approach we 
implement two search log applications to improve 
both search experience and search performance, run 
these applications on the original, the anonymity- 
preserving and the privacy-preserving search log, and 
compare the results using application-specific met- 
rics. 

We believe that our findings are of interest for other 
problems than publishing search logs. Indeed instead 
of publishing frequent keywords and queries we can 
apply the findings to publish multi-basket data or 
frequent items/itemsets in general. 

2 Preliminaries 

In this section, we introduce our model of a search 
log. We review guarantees that limit the disclosure 
of a search log and utility measures. This is an in- 
troduction to the problem of publishing frequent key- 
words, queries, clicks, etc. of search logs while at the 
same time limit the disclosure. 

2.1 Search Logs 

Search engines such as Bing, Google, or Yahoo enable 
users to ask keyword queries and return a ranked list 
of relevant websites. Users then click on one or more 
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of these links. Search engines have sophisticated ways 
to identify users; for example, users may be logged on 
to accounts provided by the search engine, or they 
might be identified via cookies and IP addresses. A 
search log is a collection of phsearch log entries that 
contain data about the users' queries and the links 
that they clicked on. We assume that a search log 
entry has the following schema: 

(user-id, query, time, clicks), 

where a user- id identifies a user, a query is a set 
of keywords, and clicks is a list of phurls indicat- 
ing that the user clicked on the given link ( phurl). 
A phuser history consists of all search entries from 
a single user. Such a history is usually partitioned 
into phsessions containing queries with similar user 
intent; many details go into this partitioning that 
are orthogonal to all the techniques in this paper. 
phQuery pairs are two subsequent queries from the 
same user that are contained in the same session. 

Search engines compute histograms, or counts of 
keywords, queries, etc., from the search log, and use 
these instead of the actual log for a variety of appli- 
cations. A phkeyword histogram of a search log S 
records for each keyword k the number of users Ck 
in whose search history in S contains that keyword. 
A keyword histogram is thus a set of pairs {k,Ck)- 
We define the phquery histogram, the phquery pair 
histogram, and the phclick histogram analogously. 

2.2 Disclosure Limitations for Pub- 
lishing Search Logs 

This section defines various disclosure limitation 
guarantees that have been proposed in the litera- 
ture. We start with fc-anonymity which prevents re- 
identifying the data of a user in the published data. 

Definition 1 (fc-anonymity [32] )• A search logs is 
k-anonymous if the search history of every individual 
is indistinguishable from the history of at least fc — 1 
other individuals in the published search log. 

Variants that require indistinguishability not at the 
level of whole search histories but at the level of ses- 
sions [27] or queries [T] have been proposed. 



Next we continue with stronger privacy definitions. 
In the past couple of years two different privacy def- 
initions have received a lot attention. The first pri- 
vacy definition limits the information an attacker 
can infer about the queries of a user. Dwork [TU] 
showed that it is impossible to provide any utility if 
this guarantee shall be given for all possible adver- 
saries. Thus people have studied limited classes of 
adversaries [221 [221 EH EH [33] . We believe that none 
of these assumptions hold in practice in the case of 
search logs and it would be insufficient to protect 
against only these classes. A suitable class has yet to 
be proposed and analyzed. 

In the remainder of the paper we focus on differen- 
tial privacy [12] . It guarantees that an attacker learns 
roughly the same information about a user whether 
or not the search history of that user was included in 
the published search log. With this guarantee a user 
does not regret having used the particular search en- 
gine for his or her queries. Dwork et al. turned this 
idea into a formal privacy definition that we apply 
here to search logs: 

Definition 2. JJ^ An algorithm A is e-differentially 
private if for all search logs S and S' differing in 
the search history of a single user and for all output 
search logs O: 

Pr[A{S) =0]< e'PrlAiS') = O]. 

This definition ensures that a user has no reason to 
complain that the search engine published S, since S 
could have also arisen from a search log S" in which 
the search history of that user was replaced by arbi- 
trary queries and clicks. We will refer to search logs 
that only differ in the search history of a single user 
as phneighboring search logs in the remainder of this 
paper. 

We will also discuss two relaxations of differential 
privacy. A probabilistic version of differential privacy 
called (e, 5)- phprobabilistic differential privacy that 
relaxes e-differential privacy has been proposed by 
Machanavajjhala et al.: 

Definition 3. 124^ An Algorithm A satisfies (e, <5)- 
probabilistic differential privacy if for all search logs 
S we can divide the output space into to sets fli, fl2 
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such that 

(l)Pr[^(S') e ^2] < S, and 

for all neighboring search logs S' and for all O G 

(2) Pi[San(S) ^ O] < e' Pi[San{S') ^ O] and 
Vx\San{S') = O] < e''VY\San{S) ^ 0\. 

This definition guarantees that algorithm A 
achieves e-differential privacy with high probabil- 
ity (> 1 — (5). The set ^2 contains all outputs 
that are considered a privacy breach according to e- 
differential privacy. The probability of such an out- 
put is bounded by b. 

Another relaxation has been proposed by Dwork et 
al.: 

Definition 4. [11] Algorithm San is (e, (5)- 
indistinguishable if for all search logs S, S' differing 
in one user history and for all subsets O of the output 
space ri; 

Pr[S'a7i(S') e O] < Pr[S'a7i(5') G O] + (5 

We will compare these two definitions in Section [SI 
In particular, we will show that differential privacy 
implies indistinguishability. On the other hand the 
converse does not hold and we show that there exists 
an algorithm that is (e', (5')-indistinguishable yet bla- 
tantly non-private (in the sense of both e-differential 
privacy and (e, (5)-probabilistic differential privacy). 

2.3 Utility Measures 

In this paper, we compare the utility of algorithms 
phtheoretically and phpractically by running appli- 
cations on the sanitized search logs and comparing 
the results with application-specific metrics. 

2.3.1 Formal Utility Measure of Frequent 
Items 

In the context of search logs, we think of keywords, 
queries, consecutive queries, clicks as items. We pro- 
pose a utility measure for algorithms that publish fre- 
quent items. Consider a discrete domain of items V. 
Each user contributes a set of these items recorded in 



a database S. Suppose we are interested in the fre- 
quent items. For simplicity, forget about the partic- 
ular counts and assume we want to publish all items 
that occur at least r times in the database. We de- 
note by fd{S) this frequency. 

We measure the inaccuracy of an algorithm as the 
expected number of items it gets wrong, i.e., it either 
includes in the output despite the fact that they are 
infrequent or it does not include them in the output 
despite the fact that they are frequent. We do not 
expect the algorithm to be perfect. It may make mis- 
takes for items with frequency very close to r which 
we neglect. The parameter ^ defines what closeness 
means. From now on we will refer to the items with 
frequency > t + ^ as the phvery frequent items and 
the items with frequency < r — ^ as the phvery in- 
frequent items. We will measure the inaccuracy of 
an algorithm only as the ability to retain the phvery 
frequent items and filter out the phvery infrequent 
items. We exclude items with frequencies within t±^ 
from our utility measure and thus allow algorithms 
to make mistakes on these close cases. 

Definition 5. For algorithm A on input S the 
{A, S) -inaccuracy with slack ^ is defined as 

E[\{d e A{S)\fd{S) < T - 
{d^A{S)\f4S)>T + m 

The expectation is taken over the randomness of 
the algorithm. As a baseline consider the simple al- 
gorithm that always outputs an empty set. On input 
S it has an inaccuracy equal to the number of items 
with frequency greater than t* + S,. 

For the results in the next sections it will be useful 
to distinguish the error of an algorithm on the very 
frequent items and the error on the very infrequent 
items. We can rewrite the inaccuracy as: 

J2 1 - e y^(5)] +^Pr[d e AiS)] 

rf:/d(S)>r+5 rfe_D:/d(S)<T-5 

Thus, the {A, S')-inaccuracy with slack ^ can be 
rewritten as the sum of the inability to retain the fre- 
quent items plus the inability to filter out the very in- 
frequent items. For example, the baseline algorithm 
has an inaccuracy on the very infrequent items of 
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and an inaccuracy on the very frequent items equal 
to the number of very frequent items. 

Definition 6. We say that an algorithm A provides 
is c-accurate for the very frequent items, if there is 
a some c > such that for any input database with 
a frequent item the probability of outputting this fre- 
quent item is > c. 

2.3.2 Experimental Utility Measures 

Traditionally, the utility of a privacy-preserving algo- 
rithm has been evaluated by comparing the input of 
the algorithm with the output to see "how much in- 
formation is lost" by comparing some statistics of the 
sanitized output with the original data. The choice 
of suitable statistics is a difficult problem as these 
statistics need to mirror the sufhcient statistics of 
applications that will use the sanitized search log. 
As notable differences, Brickell et al. [5] measure the 
utility with respect to data mining tasks and Kifer 
and Gehrke [TH] develop specific techniques to boost 
utility with respect to log-linear models. We picked 
two real applications from the information retrieval 
community to compare the utility of different algo- 
rithms: Index caching as a representative application 
for search performance, and query substitution as a 
representative application for search quality. Formal 
definitions can be found in Section [8] 

In the next few Sections [3] to [5] we will analyze var- 
ious algorithms in terms of their disclosure limitation 
guarantee and their theoretical utility. Then, in Sec- 
tion [5] we will compare the utility with respect to real 
world applications. 

3 Algorithm — Anonymity 

Let us first discuss algorithms that have been sug- 
gested to achieve different types of of fc-anonymity 
in search logs. Adar proposes the following algo- 
rithm: Given a search log partitioned into sessions, 
all queries are discarded that are associated with 
fewer than k different user-ids. In each session the 
user-id is substituted by a random number [l]. We 
call the output a fc-query anonymous search log. Mot- 
wani and Nabar substitute in each session the user-id 



by a random number and then add or delete key- 
words from sessions until each session contains the 
same keywords as at least k — 1 other sessions in 
the search log ^7\. We call the output a fc-session 
anonymous search log. He and Naughton generalize 
keywords by taking their prefix until each keyword is 
part of at least k search histories [15] . A histogram of 
the partially generalized keywords and their counts 
is then published. We call the output a fc-keyword 
anonymous search log. 

3.1 Utility Analysis 

Recall that in Definition [5] we define the inaccuracy of 
an algorithm as the expected size of the symmetric 
set difference between the set of very frequent key- 
words and the output of the algorithm. We observe 
that a fc-query anonymous search log provides per- 
fect utility for frequent queries whenever k < t* + ^. 
It might not provide perfect utility for frequent key- 
words since a frequent keyword that shows up in 
many infrequent queries might not be contained in 
the output. However, a fc-keyword anonymous search 
log provides perfect utility for frequent keywords but 
it has no utility for frequent queries or clicks. Only, 
fc-session anonymity provides utility for clicks. How- 
ever, since sessions are altered through addition and 
deletion of keywords, its precise utility for frequent 
keywords, queries or clicks depends on the input data. 

Overall, fc-anonymity provides extremely good util- 
ity according to our measure. This is because our 
target of publishing frequent items coincides with the 
guarantee of indistinguishability. 

3.2 Attacks on fc-Anonymity 

We believe that in practice there is a need for 
a stronger guarantee than fc-anonymity since fc- 
anonymity does not prevent an attacker from learning 
sensitive information. We illustrate this through the 
following homogeneity attack [23] against a fc-query 
anonymous search log. Homogeneity attacks against 
a fc-session anonymous or a fc-keyword anonymous 
search log can be constructed analogously. 

Example 1. Suppose 100 different users only asked 
the query "prescription drugs under the counter 
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SmallTown, XY". This query is published in a 100- 
anonymous search log. It is not possible to link one 
of the occurrences of this query to a single user. But 
suppose only 100 of the 2000 inhabitants of the re- 
mote village SmallTown, XY have Internet access. 
Then an attacker concludes that each one of them 
shows the intention to buy prescription drugs ille- 
gally. 

Moreover, there is a discrepancy between the ideal 
A:-anonymity and the actual implementations. The 
concept of anonymity always refers to indistinguisha- 
bility of individuals. However, in a search log there 
is no information about individuals only about user- 
ids. Since people can create multiple accounts or 
share accounts the implementations do not give rise 
to anonymity of individuals. In practice, an attacker 
can exploit this weakness and create multiple ac- 
counts and use them to link a search entry in the 
output of any of these algorithms to its data-owner 
as illustrated by the next example. Such an active 
attack has been carried out on social networks [31. 

Example 2. An attacker wants to learn whether his 
neighbor (living at address Ai ) who just moved into 
the town visited the cancer hospital (at address A2). 
The attacker initiates k — I accounts and asks the 
query "from: Ai to: A2 " with which her neighbor 
might try to calculate a route with a popular search 
engine. The attacker gets to see the k~anonymous 
search log. In case the query "from: Ai to: A2 " 
appears in the k-anonymous search log the attacker 
concludes that her neighbor has the intention of visit- 
ing the local cancer hospital. In practice, an attacker 
might try different formulations of this query to cover 
all possible ways his neighbor could express this query. 

This very simple and effective attack can be applied 
to a fc-query-anonymous or fc-session-anonymous 
search log. The attack shows that variations of k- 
anonymity do not actually prevent an attacker from 
linking sensitive information to an individual. Thus 
they fail to protect anonymity against active attacks. 

Apart from the algorithms that have been sug- 
gested for search logs there are more algorithms 
achieving variants of fc-anonymity that could be ap- 
plied to search logs. Multi-relational fc-anonymity 



can be applied as a complement to fc-query anonymity 
to publish clicks of a search log by encoding them as 
set- valued attributes [28| . FreeForm-anonymity can 
be used to strengthen fc-anonymity by considering 
more attributes as sensitive (not just the user-id) [3B]. 
However, all these extensions are still vulnerable to 
the active attack. Indeed any anonymization algo- 
rithm which can be successfully that tries to achieve 
indistinguishability at the phuser-id level instead of 
the phindividual level is vulnerable to the active at- 
tack. In practice, it is difficult to guarantee indistin- 
guishability at the phindividual level because indi- 
viduals can create multiple accounts and log-in using 
different IP-addresses. 

Next, we will study privacy guarantees that are 
neither susceptible to the homogeneity attack nor to 
the active attack. 

4 Algorithm — Privacy 

The question is how can we publish a search log that 
preserves differential privacy (as in Definition (2])? 
Unfortunately, we show that it is impossible to guar- 
antee differential privacy and provide some utility in 
the next section. As a way out of this misery we 
relax the privacy requirement a little bit and show 
that we can achieve it while providing good utility in 
Section [51 

4.1 Impossibility to Achieve Differen- 
tial Privacy and Good Utility 

In this section we will show that in order to achieve 
good utility it is actually phnecessary to relax e- 
differential privacy. In the next section we will dis- 
cuss relaxations that actually achieve good utility. 
This separation result reveals an interesting relation 
between e-differential privacy and its probabilistic re- 
laxation. It illustrates the exorbitant gain in utility 
when a small probability of a privacy breach can be 
tolerated. We measure the utility for publishing fre- 
quent items as defined in Section [2.3. II 

We consider a discrete domain T>. A database S 
contains for each user a set of at most m values in T>. 
The number of users is denoted by U and assumed 
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to be fixed. Thus, we can represent a database as 
a U ■ m-dimensional vector over 1) U {-L}. We can 
think of T) as the set keywords, queries, chcks, or 
consecutive queries and S' as a search log for which 
we try to retain the frequent items. 

The next theorem analyzes the balance between 
the retaining very frequent items and filtering out 
very infrequent items. No e-differentially private al- 
gorithm can be good at both for all input databases. 
In particular, if an algorithm outputs one very fre- 
quent item d of a database then this item will also 
show up with some probability in the output of 
databases where the item is very infrequent. Thus 
if an algorithm manages to improve the accuracy of 
retaining a frequent item over the baseline algorithm 
then this will induce inaccuracy of filtering out that 
item in any database where it is infrequent. 

Lemma 1. Consider an e- differentially private al- 
gorithm A that retains a very frequent item d £ V 
of some database S G T>^ with probability p. Then 
d will also be included in the output of any database 
S' £ T)^ with probability at least p/ [e^^^^'^ ^''^) even 
if that item is very infrequent in that database. Here 
Li{S,S') denotes the Li distance between S and 
S' , i.e., Li{S,S') — frequency of d in S — 

frequency of d in S'\. 

The proof follows is a direct consequence from the 
privacy definition. 

As a corollary we have that any e-differentially pri- 
vate algorithm that is accurate for the very frequent 
items will be bad in filtering out low frequent items. 

Corollary 1. Consider an accuracy constant c, a 
threshold t, a slack ^ and a very large domain T> of 
size > (e'^'-'^"'"^-'/c + l) . Any e-differentially pri- 
vate algorithm A that is c- accurate for the very fre- 
quent items as defined in Definition will be bad 
in filtering out very infrequent items in all databases. 
In particular, its inaccuracy for any input database is 
greater than the inaccuracy of the baseline algorithm 
that always outputs the empty set. 

Proof. For contradiction assume that such an algo- 
rithm exist. Call this algorithm A' . Fix some input 
S' . For each item d € V construct S'^ by changing 



T -I- ^ of the items to d. That way d is very frequent 
(with frequency at least r-l-^) and Li{S, S'^) = t-\-£^. 
By Definition [HI we have that 

Pr[rf e A'{S',)] > c. 

By Lemma [T] it follows that the probability of 
outputting d is at least c/(e'^^'^+^-') for any input 
database. Hence, if we sum up this probability over 
all possible values d € V that arc infrequent in S we 
obtain (^\V\ - ^) c/(e"(^+«)). This means that for 

any database S" the inaccuracy of filtering out the 
infrequent items is worse than the baseline algorithm 
for very large domains. □ 

In search logs we are dealing with a large domains. 
For example, publishing consecutive query pairs for 
which each query contains at most 3 keywords from 
a limited vocabulary of 900,000 words results in a 
domain size of 5.3 x 10'^''^. Corollary [1] implies that 
any e-differentially private algorithm that is 0.01- 
accurate for very frequent queries provides less than 
the algorithm that outputs the empty set if we con- 
sider typical parameters of r -I- ^ = 50, m = 10,U = 
1, 000, 000, e = 1. In the next section we are going to 
relax differential privacy to overcome this impossibil- 
ity result. 

5 Algorithm - Prob. Privacy 

In this section we show how to achieve good utility 
in publishing frequent keywords, queries, etc. of a 
search log while at the same time guaranteeing a re- 
laxed version of differential privacy. 

This section reviews a simple privacy-preserving 
algorithm for publishing histograms from a search 
log that has been independently developed by [20] 
and us [HI. We call this algorithm ZEALOUS. 
Korolova et al. [20] offer an analysis of {e',S')- 
indistinguishability while we present an analysis of 
(e, (5)-probabilistic differential privacy. Later in Sec- 
tion [6] we compare the two definitions and show 
that (e, (5)-probabilistic differential privacy is strictly 
stronger. 

One of the advantages of ZEALOUS is its simplic- 
ity: It uses a two-step process to eliminate the tail of 
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the search log, i.e., the keywords with low counts, to 
achieve a strong privacy guarantee. We give the pseu- 
docode of ZEALOUS next; Figure [T] gives a pictorial 
description of ZEALOUS. For now, think of items as 
either keywords or queries. More details follow the 
algorithm. 

Algorithm ZEALOUS for Publishing Frequent 
Items of a Search Log 

Input: Search log §, positive numbers m. A, r, r' 

1. For each user u select a set Su of up to m distinct 
items from u's search history in §. 

2. Create the item histogram of pairs {k, Ck) from 
the selected items. For each item k we report 
the number of users such that k occurs in their 
search history s„. We call this histogram the 
phoriginal histogram. 

3. Delete from the histogram the pairs (fc, Ck) with 
count Cfe less than t. 

4. For each pair (fc, Ck) in the histogram sam- 
ple a random number rjk from the distribution 
Lap(Ajl and add it to the count, resulting in a 
noisy count: Ck ^ Ck + Vk- 

5. Delete from the histogram the pairs (fc, Ck) with 
noisy counts Ck that are no more than t'. 

6. Publish the remaining items and their noisy 
counts. We call this histogram the phsanitized 
histogram. 

Step 1., 2. and 4. of the algorithm are fairly stan- 
dard. It is known that adding Laplacian noise to 
histogram counts achieves e-differential privacy [12] . 
What is new is that we restrict the histogram to 
items with counts at least t in Step 2. in order to 
be able to deal with large domains. This restriction 
leaks information and thus the output after Step 4. is 
not e-differentially private. One can show that it is 
not even (e, (5)-probabilistic differentially private (for 
5 < 1/2). Step 5. disguises the information leaked in 
Step 3. in order to achieve probabilistic differential 
privacy. 



The Laplace distribution with scale parameter A has the 
probability density function -^e ~ . 
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Figure 1: Privacy-Preserving Algorithm. 



5.1 Indistinguishability Analysis 

We can use ZEALOUS to publish frequent keywords, 
queries, or consecutive query pairs. To publish clicks 
Korolova et al. 20\ suggest to first determine the 
frequent queries and then publish noisy counts of 
the clicks to their top-100 ranked documents. If we 
choose the noise of the click counts also to be 2m /e 
then publishing frequent queries and their click dis- 
tribution is (2e, (5)-private. 



Theorem 1. \20f Given a search log S and positive 
numbers m, t, t' , and \, ZEALOUS achieves{e' , S')- 
indistinguishability, if 



A > 2m/e, and 



T = 1 



r' > m 1 - 



logi 



2(5' 



(1) 

(2) 
(3) 



It is recommended to set 5' < 1/U, where U denotes 
the number of users. 
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5.2 Probabilistic Diff. Privacy Analy- 
sis 

The following theorem tells us how to set the parame- 
ters A and r', given values for e, S, t and m such that 
ZEALOUS guarantees (e, (5)-probabilistic differential 
privacy. 

Theorem 2. Given a search log S and positive num- 
bers m, T, t' , and X, ZEALOUS achieves (e,(5)- 
probabilistic differential privacy, if 

A > 2m/e, and (4) 



r'— T > max 




(5) 

where U denotes the number of users in S . 

The proof can be found in the Appendix [Al 
Next, we will consider a few particular parameter 

settings to get a quantitative feeling for the theorems 

of this section. 

5.3 Quantitative Comparison of Prob. 
Diff. Privacy and Indistinguisha- 
bility for ZEALOUS 

For fixed noise and threshold parameters the anal- 
ysis of [20] stated in Theorem [T] lets us determine 
what level of (e, (5')-indistinguishability can be guar- 
anteed and our analysis stated in Theorem [2] lets us 
determine what level of (e, (5)-probabilistic differen- 
tial privacy can be guaranteed. The results can be 
found in Table [T] It gives us an impression of how 
the two privacy guarantees compare with respect to 
one particular algorithm. We fixed the number of 
users U = 500k and their contributions to m = 5. 
This is a typical setting that we will explore in the 
experiments. 

First we want to remark that the e parameter is 
the same for indistinguishability and probabilistic dif- 
ferential privacy. This parameter is inverse propor- 
tional to the noise A. This illustrates the tradeoff 
between utility and privacy: For higher values of A 
the utility decreases, but the privacy guarantees get 



stronger. Similarly, with increasing r' the utility de- 
creases since fewer items are being published but S 
and S' decrease yielding stronger privacy guarantees. 

Not all settings in this table give good privacy guar- 
antees. For example, consider r' = 50, A = 5. Here 
our analysis does not allow to bound the probability 
of a privacy breach: 6 = 1. Also, (2,3.1 x 10"^)- 
indistinguishability is considered an insufficient guar- 
antee since it is recommended to set (5' < l/U, 
where U denotes the number of users |20j. In our 
case S' = 3.1 X 10"'' > l/U when we consider 
U = 500, 000. Hence, according to the two analy- 
ses it cannot be recommended to publish the query 
histogram. We observe that 5 < 6' . Indeed, in the 
next section we will see that probabilistic differential 
privacy is a stronger guarantee than indistinguisha- 
bility. 

5.4 Utility Analysis 

Next we analyze the accuracy of the ZEALOUS al- 
gorithm for publishing frequent items and itemsets. 

Theorem 3. Consider running ZEALOUS with pa- 
rameters T = T* —^,t' = t* +^ and noise A on some 
input DB. 

The inaccuracy with slack ^ is 

J2 l/2e-'^/^ + 

d:/d(Z3B)>T+C deD:fa{DB)<T^i 

In particular, this means that ZEALOUS is 1 — 
l/2e^~>: -accurate for the very frequent items (of fre- 
quency > T* -\- C^) and it provides perfect accuracy for 
the very infrequent items (of frequency < r* — ^j. 

Proof. It is easy to see that the ZEALOUS' accuracy 
of filtering out infrequent items is perfect. Moreover, 
the probability of outputting a frequent item is at 
least 

1 - l/2e"^ 

which is the probability that the Lap(A)-distributed 
noise that is added to the count is at least — ^. In this 
case a frequent item with count at least r -I- ^ remains 
in the output of the algorithm. This probability is at 
least 1/2. Hence, this algorithm is better at retaining 
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Privacy Guarantee 


r' 


= 50 




t' 


= 100 




r' 


= 150 




r' 


= 200 




A = 1 (e,e' = 10) 


6 


= 6.6 X 10" 


lei 


6 


= 1.3 X 10- 


37 


5 


= 2.5 X 10- 


59 


S 


= 4.7 X 10- 


81 




5' 


= 7.2 X 10 


-20 


5' 


= 1.4 X 10 


-41 


6' 


= 2.7 X 10 


-63 


5' 


= 5.2 X 10 


-85 


A = 5 (e,e' = 2) 


5 


= 1 




6 


= 3.2 X 10- 


3 


S 


= 1.5 X 10- 


7 


6 


= 6.5 X 10" 


12 




5' 


= 3.1 X 10 


-4 


5' 


= 1.4 X 10 


-8 


6' 


= 6.4 X 10 


-13 


S' 


= 2.9 X 10 


-17 



Table 1: (e', (5')-indistinguishability vs. (e, (5)-probabilistic differential privacy of releasing query counts. 
U = 500fc, m = 5. 



the frequent items. All in all it has higher accuracy 
than the basic algorithm on all inputs on which the 
basic algorithm has sub-optimal accuracy (i.e. on all 
inputs with frequent items). □ 

This completes our separation result. 

Theorem 4 (Separation Result). Our {e,S)- prob- 
abilistic differentially private algorithm ZEALOUS 
is able to retain frequent items with probability at 
least 1/2 while filtering out all infrequent items. On 
the other hand any e- differentially private algorithm 
which is able to retain frequent items with non-zero 
probability (independent of the input database) will be 
more inaccurate for large domains than the baseline 
algorithm which always outputs the empty set. 

In the impossibility result of Section [4] we saw 
that every differentially private algorithm that re- 
tains very frequent items has to output any item 
with some small probability even when it is infre- 
quent. In a large domain such a procedure will hide 
the true very frequent items and destroy the utility of 
the data. The reason why ZEALOUS can deal bet- 
ter with large domains is the filtering Step 3. which 
ensures perfect accuracy in filtering out infrequent 
items. 

6 Comparing Indistinguishabil- 
ity with Prob. Differential 
Privacy 

In this section we study the relationship between 
(e, (5)-probabilistic differential privacy and (e',(5')- 
indistinguishability. First we will prove that prob- 
abilistic differential privacy implies indistinguisha- 



bility. Then we will show that the converse is 
not true. We show that there exists an algorithm 
that is (e'j (5')-indistinguishable yet blatantly non- 
private (in the sense of both e-differential privacy 
and (e, i5)-probabilistic differential privacy). This fact 
might convince a data publisher to strongly prefer an 
algorithm that achieves (e, (5)-probabilistic differen- 
tial privacy over one that is only known to achieve 
(e', (S')-indistinguishability. It also might convince re- 
searchers to analyze the probabilistic privacy guaran- 
tee of algorithms that are only known to be indistin- 
guishable such as the algorithms in [11] or [29] . 

First we show that our definition implies (e, i5)- 
indistinguishability. 

Proposition 1. // an algorithm A is (e, 5)- 
probabilistic differentially private then it is also (e, 5)- 
indistinguishable. 

The proof can be found in the Appendix IB. II 
The converse of Proposition [1] does not hold as il- 
lustrated in the next example. 

6.1 A Separation of Indistinguishabil- 
ity and Prob. Differential Privacy 

We will give an algorithm that is {e',S')- 
indistinguishable but not (e, (5)-probabilistic dif- 
ferentially private for any choice of e and S < I. We 
already know from the previous section that prob- 
abilistic differential privacy offers a guarantee that 
is always at least as strong as indistinguishability. 
Now, we will see that in some cases it can actually 
provide a much stronger guarantee. 

Example 3. Consider the following algorithm that 
takes as input a search log S with a search history for 
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each of the U users. Let us assume that these search 
histories all come from a finite domain T) of search 
histories of up to a certain length. This algorithm 
is given search histories for all users and it is going 
to return a search history that is unequal to the first 
user's search history. Any search history with that 
property is returned with equal probability. 

Algorithm A 
Input: Search log S G 

1. Sample uniformly at random a single search his- 
tory from the set of all histories excluding the 
first user's search history. 

2. Return this search history. 

The following proposition analyzes the privacy of 
Algorithm A. 

Proposition 2. For any finite domain of search 
histories T> Algorithm A is {e',l/{\T>\ — 1))- 
indistinguishable for all e' > on inputs from T>^ . 

The proof can be found in Apvendix \B.2[ 
The next proposition shows that every single output 
of the algorithm constitutes a privacy breach. 

Proposition 3. For any search log S , the output of 
Algorithm A constitutes a privacy breach according 
to e- differentially privacy for any value of e. 

Proof. Fix an input S and an output O that is differ- 
ent from tfie searcli history of the first user. Consider 
the input 5" differing from S only in the first user his- 
tory, where S[ ~ O. Here 

1/{\V\ - 1) = Pr[^(5) = O] ^ Pr[A{S') = O] = 

Thus the output S breaches the privacy of the first 
user according to e-differentiaUy privacy. □ 

Corollary 2. Algorithm A is {e',l/{\'D\ - 1))- 
indistinguishable for all e' > 0. But it is not {e,S)- 
probabilistic differentially private for any e > and 
any S < 1. 

It was advised to set S' smaller than the inverse of 
the number of users when setting the parameters for 



T 


1 


3 


5 


7 


9 


t' 


81.1205 


78.7260 


78.6827 


79.3368 


80.3316 



Table 2: r' as a function of r for m = 2, e = 1, 
5 = 0.01 



an indistinguishable algorithm. Moreover, or corol- 
lary states that even if you set 5' = l/dl?! — 1) the 
probability that a privacy breach happens cannot be 
bounded. We would recommend to set 6' smaller 
than the inverse of the size. 

7 Choosing Parameters 

Apart from the privacy parameters e and S, ZEAL- 
OUS requires the data publisher to specify two more 
parameters: r, the first threshold used to eliminate 
keywords with low counts (Step 3), and m, the num- 
ber of contributions per user. These parameters af- 
fect both the noise added to each count as well as the 
second threshold r'. Before we discuss the choice of 
these parameters we explain the general set-up of our 
experiments. 

Data. In our experiments we work with a search log 
of user queries from the Yahoo! search engine col- 
lected from 500,000 users over a period of one month. 
This search log contains about one million distinct 
keywords, three million distinct queries, three million 
distinct query pairs, and 4.5 million distinct clicks. 

Privacy Parameters. In all experiments we set 
6 = 0.001. Thus the probability that the output of 
ZEALOUS could breach the privacy of any user is 
appropriately small. We explore different levels of 
(e, (5)-probabilistic differential privacy by varying e. 

7.1 Choosing Threshold r 

We would like to retain as much information as possi- 
ble in the published search log. A smaller value for r' 
immediately leads to a histogram with higher utility 
because fewer items and their noisy counts are filtered 
out in the last step of ZEALOUS. Thus if we choose 
T in a way that minimizes r' we maximize the util- 
ity of the resulting histogram. Interestingly, choosing 
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T — 1 does not necessarily minimize the value of r'. 
Table [2] presents the value of r' for different values of 
T for m = 2 and e = 1 . As we can see, for our param- 
eter setting t' is minimized if t = 4. We can show 
the following optimality result which tells us how to 
choose r optimally in order to maximize utility. 

Proposition 4. For fixed e, S and m choosing t — 
[2m/e] minimizes the value of t' . 

The proof follows from taking the derivative of r' as 
a function of r (based on Equation ([5])) to determine 
its minimum. 

7.2 Choosing the Number of Contri- 
butions m 

Proposition 3] tells us how to set t in order to maxi- 
mize utility. Next we will discuss how to set m opti- 
mally. We will do so by studying the effect of vary- 
ing m on the coverage and the precision of the top-j 
most frequent items in the sanitized histogram. The 
top-j coverage of a sanitized search log is defined as 
the fraction of distinct items among the top-j most 
frequent items in the original search log that also ap- 
pear in the sanitized search log. The top-j precision 
of a sanitized search log is defined as the distance be- 
tween the relative frequencies in the original search 
log versus the sanitized search log for the top-j most 
frequent items. In particular, we study two distance 
metrics between the relative frequencies: the average 
L-1 distance and the KL-divergence. 

As a first study of the coverage Table [3] shows the 
number of distinct items (recall that items can be 
keywords, queries, query pairs, or clicks) in the sani- 
tized search log as m increases. We observe that cov- 
erage decreases as we increase m. Moreover, the de- 
crease in the number of published items is more dra- 
matic for larger domains than for smaller domains. 
The number of distinct keywords decreases by 55% 
while at the same time the number of distinct query 
pairs decreases by 96% as we increase m from 1 to 
40. This trend has two reasons. First, from The- 
orem [5] and Proposition d] we see that threshold r' 
increases super-linearly in m. Second, as m increases 
the number of keywords contributed by the users in- 
creases only sub-linearly in m; fewer users are able 
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20 


40 


keywords 


6667 


6043 
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4062 


2964 


queries 


3334 


2087 


1440 


751 


408 


clicks 


2813 


1576 


1001 
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query pairs 
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100 


40 
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TO 
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20 


40 


keywords 


329 


1157 


1894 


3106 


3871 


queries 


147 


314 


402 


464 


439 


clicks 


118 


234 


286 


317 


290 


query pairs 


8 


14 


15 


12 
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Table 3: 





keyw. 


queries 


click query p. 


avg items/user 


56 


20 


14 7 



Table 4: Avg number of items per user in the original 
search log 



to supply TO items for increasing values of m. Hence, 
fewer items pass the threshold t' as to increases. The 
reduction is larger for query pairs than for keywords, 
because the average number of query pairs per user 
is smaller than the average number of keywords per 
user in the original search log (shown in Tabled]). 

To understand how to affects precision, we measure 
the total sum of the counts in the sanitized histogram 
as we increase to in Table [31 Higher total counts 
offer the possibility to match the original distribu- 
tion at a finer grain. We observe that as we increase 
TO, the total counts increase until a tipping point is 
reached after which they start decreasing again. This 
effect is as expected for the following reason: As m 
increases, each user contributes more items, which 
leads to higher counts in the sanitized histogram. 
However, the total count increases only sub-linearly 
with TO (and even decreases) due to the reduction in 
coverage we discussed above. We found that the tip- 
ping point where the total count starts to decrease 
corresponds approximately to the average number of 
items contributed by each user in the original search 
log (shown in Table |4|). This suggests that we should 
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choose m to be smaller than the average number of 
items, because it offers better coverage, higher total 
counts and reduces the noise compared to higher val- 
ues of TO. 

Let us take a closer look at precision and coverage 
of the histograms of the various domains in Figures [5] 
and[3l In Figure[2]we vary to between 1 and 40. Each 
curve plots the precision or coverage of the sanitized 
search log at various values of the top-j parameter in 
comparison to the original search log. We vary the 
top-j parameter but never choose it higher than the 
number of distinct items in the original search log for 
the various domains. The first two rows plot precision 
curves for the average L-1 distance (first row) and the 
KL-divergence (second row) of the relative frequen- 
cies. The lower two rows plot the coverage curves, 
i.e., the total (relative, respectively) number of top-j 
items in the original search log that do not appear 
in sanitized search log in the third row (fourth row, 
respectively). First, observe that the coverage de- 
creases as TO increases, which confirms our discussion 
about the number of distinct items. Moreover, we 
see that the coverage gets worse for increasing values 
of the top-j parameter. This illustrates that ZEAL- 
OUS gives better utility for the more frequent items. 
Second, note that for small values of the top-j param- 
eter, values of TO > 1 give better precision. However, 
when the top-j parameter is increased, to = 1 gives 
better precision because the precision of the top-j 
values degrades due to items no longer appearing in 
the sanitized search log due to the increased cutoffs. 

FigureOshows the same statistics varying the top-j 
parameter on the x-axis. Each curve plots the preci- 
sion for TO = 1,2,4,8, 10,40, respectively. Note that 
TO = 1 does not always give the best precision; for 
keywords, to = 8 has the lowest KL-divergence, and 
for queries, to = 2 has the lowest KL-divergence. As 
we can see from these results, there are two "regimes" 
for setting the value of to. If we are mainly interested 
in coverage, then to should be set to 1. However, if we 
are only interested in a few top-j items then we can 
increase precision by choosing a larger value for to; 
and in this case we recommend the average number 
of items per user. 

We will see this dichotomy again in our real ap- 
plications of search log analysis: The index caching 



application does not require high coverage because 
of its storage restriction. However, high precision of 
the top-j most frequent items is necessary to deter- 
mine which of them to keep in memory. On the other 
hand, in order to generate many query substitutions 
a larger number of distinct queries and query pairs is 
required. Thus to should be set to a large value for 
index caching and to a small value for query substi- 
tution. 

8 Application-Oriented Evalu- 
ation 

In this section we compare the utility provided by 
algorithms guaranteeing privacy and anonymity. As 
a baseline we consider publishing the original search 
log. To be clear, our utility evaluation is not sup- 
posed to determine the better algorithm. When 
choosing an algorithm in practice one has to con- 
sider both utility and disclosure limitation guarantee 
of an algorithm. We have compared the guarantee 
offered by fc-anonymity, differential privacy and its 
relaxation in Sections [H IH and El In this section we 
study the price we have to pay (in terms of decreased 
utility) when guaranteeing privacy as opposed to util- 
ity. 

Algorithms. We experimentally compare the utility 
of ZEALOUS against a representative fc-anonymity 
algorithm [1] for publishing search logs. This algo- 
rithm creates a fc-query anonymous search log as fol- 
lows: First all queries that are posed by fewer than 
k distinct users are eliminated. Then histograms of 
keywords, queries, and query pairs from the fc-query 
anonymous search log are computed. 

ZEALOUS can be used to achieve {e',6')- 
indistinguishability as well as (e, (5)-probabilistic dif- 
ferential privacy. For the ease of representation we 
only documented the probabilistic differential pri- 
vacy guarantee. But using Theorems [T] and [2] it is 
straightforward to compute the corresponding indis- 
tinguishability guarantee. For brevity, we refer to the 
(e, (S)-probabilistic differentially private algorithm as 
e-Differential in the figures. 

Evaluation Metrics. We evaluate the performance 
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Figure 2: Effect on stats by varying m for several values of top-j. 



of the algorithms in two ways. First, we measure how 
well the output of the algorithms preserves certain 
statistics of the original search log. Second, we pick 
two real applications from the information retrieval 
community to evaluate the utility of ZEALOUS: In- 
dex caching as a representative application for search 
performance, and query substitution as a representa- 
tive application for search quality. This will help us 
to fully understand the performance of ZEALOUS in 
an application context. 

We first describe our utility evaluation with statis- 



tics in Section 18.11 and then with real applications in 
Sections im and iJl 

8.1 General Statistics 

We explore different statistics that measure the differ- 
ence of sanitized histograms to the histograms com- 
puted using the original search log. We analyze the 
histograms of keywords, queries, and query pairs for 
both sanitization methods. For clicks we only con- 
sider ZEALOUS histograms since a /c-query anony- 
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Figure 3: Effect on Statistics Of Varying j in top-j for Different Values of m. 



mous search log is not designed to publish click data. 

In our first experiment we compare the distribution 
of the counts in the histograms. Note that a fc-query 
anonymous search log will never have query and key- 
word counts below k, and similarly a ZEALOUS his- 
togram will never have counts below t'. We choose 
e = 5,m — 1 for which threshold r' w 10. Therefore 
we deliberately set k = 10 such that k ^ t' for a 
comparable setting. 

Figure S] shows the distribution of the counts in 
the histograms on a log-log scale. We see that the 
power-law shape of the distribution is well preserved. 
However, the total frequencies are lower for the sani- 
tized search logs than the frequencies in the original 
histogram because our sanitization methods filter out 
user contributions. We also see the cutoffs created by 
k and t'. We observe that as the domain increases 
from keywords to clicks and query pairs the number 



of infrequent items becomes larger for the original 
search log. For example, the number of clicks with 
count one is an order of magnitude larger than the 
number of keywords with count one. 

While it is good to know that the shape of the 
count distribution is well preserved, we would also 
like to know whether the counts of frequent keywords, 
queries, query pairs, and clicks are also preserved 
and what impact the privacy parameters e and the 
anonymity parameter k have. 

Figure[5]shows the average differences to the counts 
in the original histogram. We scaled up the counts 
in sanitized histograms by a common factor so that 
the total counts were equal to the total counts of the 
original histogram, then we calculated the average 
difference between the counts. The average is taken 
over all keywords that have non-zero count in the 
original search log. As such this metric takes both 
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Figure 4: Distributions of counts in the histograms. 
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Figure 5: Average difference between counts in the original histogram and the probabilistic differential 
privacy-preserving histogram, and the anonymous histogram for varying privacy / anonymity parameters e 
and k. Parameter m is fixed to 1. 



coverage and precision into account. 

As expected, with increasing e the average differ- 
ence decreases, since the noise added to each count 
decreases. Similarly, by decreasing k the accuracy 
increases because more queries will pass the thresh- 
old. Figure [5] shows that the average difference is 
comparable for the k anonymous histogram and our 
ZEALOUS histogram. For keywords we observe that 
the ZEALOUS histogram is more accurate than a 
k anonymous histogram for all values of e > 2. For 
queries we obtain roughly the same average difference 
for fc = 60 and e = 6. For query pairs the fc-query 
anonymous histogram provides better utility. 

We also computed other metrics such as the root- 
mean-square value of the differences and the total 
variation difference; they all reveal similar qualita- 
tive trends. Despite the fact that ZEALOUS disre- 
gards many search log records (by throwing out all 
but m contributions per user and by throwing out 



low frequent counts), ZEALOUS is able to preserve 
the overall distribution well. 

8.2 Index Caching 

Search engines maintain an inverted index which, in 
its simplest instantiation, contains for each keyword 
a posting list of identifiers of the documents in which 
the keyword appears. This index can be used to an- 
swer search queries, but also to classify queries for 
choosing sponsored search results. The index is too 
large to fit in memory, but maintaining a part of it in 
memory reduces response time for all these applica- 
tions. In the index caching problem, we aim to store 
in memory a set S of posting lists that maximizes 
the hit-probability over all keywords (the formula- 
tion of the problem is from Baeza- Yates [3]). Given 
such a set S and a probability distribution over the 
likelihood of occurrence of keywords in a query, the 
hit-prohability is the sum of the likelihoods of the kcy- 
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Figure 6: Hit probabilities. 



words whose posting list are kept in memory. 

In our experiments, we use an improved version of 
the algorithm developed by Baeza- Yates to decide 
which posting lists should be kept in memory [3]. 
This algorithm first assigns each keyword a score, 
which equals its frequency in the search log divided 
by the number of documents that contain the key- 
word. Keywords are chosen using a greedy bin- 
packing strategy where we sequentially add posting 
lists from the keywords with the highest score until 
the memory is filled. In our experiments we fixed the 
memory size to be 1 GB, and each document posting 
to be 8 Bytes (other parameters give comparable re- 
sults) . Our inverted index stores the document post- 
ing list for each keyword sorted according to their 
relevance which allows to retrieve the documents in 
the order of their relevance. We truncate this list 
in memory to contain at most 200,000 documents. 
Hence, for an incoming query the search engine re- 
trieves the posting list for each keyword in the query 
either from memory or from disk. If the intersection 
of the posting lists happens to be empty, then less 
relevant documents are retrieved from disk for those 
keywords for which only the truncated posting list is 
kept on memory. 

Figure [6] shows the hit-probabilities of the inverted 
index constructed using the original search log, the 
fc-anonymous search log, and the ZEALOUS his- 
togram with our greedy approximation algorithm. 
Figure ^a) shows that our ZEALOUS histogram 
achieves better utility than the fc-anonymous search 
log for a range of parameters. We note that the utility 
only suffers marginally when increasing the privacy 



parameter or the anonymity parameter (at least in 
the range that we have considered). 

As a last experiment we study the effect of vary- 
ing m on the hit-probability in Figure [5]Jb) . We ob- 
serve that the hit probability for m = 6 is above 0.36 
whereas the hit probability for m = 1 is less than 
0.33. As discussed a higher value for ni increases the 
accuracy, but reduces the coverage. Index caching 
really requires roughly the top 85 most frequent key- 
words that are still covered when setting m — 6. 
We also experimented with higher values of m and 
observed that the hit-probability decreases at some 
point. This confirms our findings about setting the 
value m from Figure [U For this application, the san- 
itized data should accurately model the relative fre- 
quencies of the most frequent keywords in the original 
search log, and thus a larger value of m gives more 
accurate estimates of the hit-probability. 

8.3 Query Substitution 

Query Substitution studies how to rephrase a user 
query to match it to documents or advertisements 
that do not contain the actual keywords of the query 
but contain relevant information. Query substitu- 
tion has applications in query refinement, sponsored 
search, and spelling error correction [17] . Algorithms 
for query substitution examine query pairs to learn 
how users re-phrase queries. We use an algorithm 
developed by Jones et al. in which related queries 
for a query are identified in two steps [17j . First, the 
query is partitioned into subsets of keywords, called 
phrases, based on their mutual information. Next, for 
each phrase, candidate query substitutions are deter- 
mined based on the distribution of queries. 

We run this algorithm to generate ranked substi- 
tution on the sanitized search logs. We then com- 
pare these rankings with the rankings produced by 
the original search log which serve as ground truth. 
To measure the quality of the query substitutions, 
we compute the precision/recall, MAP (mean average 
precision) and NDG (normalized discounted cumula- 
tive gain) of the top-j suggestions for each query; let 
us define these metrics next. 

Consider a query q and its list of top-j ranked sub- 
stitutions q'^, . . . ,q'j_i computed based on a sanitized 
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search log. We compare this ranking against the top- 
j ranked substitutions go, • ■ • , qj-i computed based 
on the original search log as follows. The precision 
of a query q is the fraction of substitutions from the 
sanitized search log that are also contained in our 
ground truth ranking: 



Precision((7) 



|{go,...,gj--i}n{g^,...,g^-_i}| 
\{q'o,...,q'^,}\ 



Note, that the number of items in the ranking for a 
query q can be less than j. The recall of a query q is 
the fraction of substitutions in our ground truth that 
are contained in the substitutions from the sanitized 
search log: 



Recall((7) 



|{go,---,gj-i}n{g^,...,g^_J| 
\{qo,...,qj-i}\ 



MAP measures the precision of the ranked items for 
a query as the ratio of true rank and assigned rank: 



MAP{q) = 



i + 1 



rank of qi in [q'^, . . . , J + 1 ' 



where the rank of qi is zero in case it does is not 
contained in the list [q^, . . . jq'j^i] otherwise it is i', 
s.t. qi = q'-, . 

Our last metric called NDCG measures how the rel- 
evant substitutions are placed in the ranking list. It 
does not only compare the ranks of a substitution in 
the two rankings, but is also penalizes highly relevant 
substitutions according to [qg, . . . that have a 

very low rank in [(jg, . . . , q'j-i]- Moreover, it takes the 
length of the actual lists into consideration. We refer 
the reader to the paper by Chakrabarti et al. [9] for 
details on NDCG. 

The discussed metrics compare rankings for one 
query. To compare the utility of our algorithms, we 
average over all queries. For coverage we average over 
all queries for which the original search log produces 
substitutions. For all other metrics that try to cap- 
ture the precision of a ranking, we average only over 
the queries for which the sanitized search logs pro- 
duce substitutions. We generated query substitution 
only for the 100,000 most frequent queries of the orig- 
inal search log since the substitution algorithm only 
works well given enough information about a query. 
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Figure 8: Coverage of the privacy-preserving his- 
tograms for m = 1 and m = 6. 

In Figure [7] we vary k and e for m — \ and 
we draw the utility curves for top-j for j = 2 and 
J = 5. We observe that varying e and k has hardly 
any influence on performance. On all precision mea- 
sures, ZEALOUS provides utility comparable to k- 
query-anonymity. However, the coverage provided by 
ZEALOUS is not good. This is because the compu- 
tation of query substitutions relies not only on the 
frequent query pairs but also on the count of phrase 
pairs which record for two sets of keywords how of- 
ten a query containing the first set was followed by 
another query containing the second set. Thus a 
phrase pair can have a high frequency even though all 
query pairs it is contained in have very low frequency. 
ZEALOUS filters out these low frequency query pairs 
and thus loses many frequent phrase pairs. 

As a last experiment, we study the effect of in- 
creasing m for query substitutions. Figure [8] plots 
the average coverage of the top-2 and top-5 substitu- 
tions produced by ZEALOUS for m = 1 and m — Q 
for various values of e. It is clear that across the 
board larger values of m lead to smaller coverage, 
thus confirming our intuition outlined the previous 
section. 



9 Beyond Search Logs 

While the main focus of this paper are search logs, 
our results apply to other scenarios as well. For exam- 
ple, consider a retailer who collects customer transac- 
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Figure 7: Quality of the query substitutions of the privacy-preserving histograms, and the anonymous search 
log. 



tions. Each transaction consists of a basket of prod- 
ucts together with their prices, and a time-stamp. 
Our results also apply to publishing frequently pur- 
chased products or sets of products. This informa- 
tion can be used in in a recommender system or in 
a market basket analysis to decide on the goods and 
promotions in a store [l4] . 

Another example concerns monitoring the health of 
patients. Each time a patient sees a doctor the doctor 
records the diseases of the patient and the suggested 
treatment. It would be interesting to publish frequent 
combinations of diseases. 

All of our results apply to the more general problem 
of publishing frequent items / itemsets / consecutive 
itemsets. Existing work on publishing frequent item- 
sets often only tries to achieve anonymity or makes 
strong assumptions about the background knowledge 
of an attacker [30, 26] El EH EH El [Ml [32 . We 
explain how to protect (e, (5)-probabilistic differential 
privacy against all possible attackers in Section[5l We 
also show that it is impossible to achieve e-differential 
privacy and good utility and thus this probabilistic 
guarantee is the best we can hope for. 

10 Conclusions 

This paper contains a comparative study about 
publishing frequent keywords, queries, and clicks 
in search logs. We compare the disclosure lim- 
itation guarantees and the theoretical and prac- 
tical utility of various approaches. Our com- 



parison includes earlier work on anonymity and 
(e', (5')-indistinguishability and our proposed solution 
to achieve (e, i5)-probabilistic differential privacy in 
search logs. These results (negative as well as posi- 
tive) apply more generally to the problem of publish- 
ing frequent (and possibly sequential) items or item- 
sets. In our comparison, we revealed interesting rela- 
tionships between different variations of relaxations 
which might be of independent interest. 

A topic of future work is the development of algo- 
rithms that allow to publish useful information about 
infrequent keywords, queries, and clicks in a search 
log. 
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A Analysis of ZEALOUS: 
Proof of Theorem 2 

Let H be the keyword histogram constructed by 
ZEALOUS in Step 2 when apphed to S and K be 
the set of keywords in H whose count equals r. Let 
Vt be the set of keyword histograms, that do not con- 
tain any keyword in K. For notational simplicity, 
let us denote ZEALOUS as a function Z. We will 
prove the theorem by showing that, given Equations 
O and dSl), 

Pr[Z{S) in]<5, (6) 

and for any keyword histogram w G f2 and for any 
neighboring search log S' of 5, 

e-''-Pr[Z(S')^uj] < Pr[Z{S)^Lu] < e'-Pr[Z{S')^uj]. 

(7) 

We will first prove that Equation ([6]) holds. As- 
sume that the i-th keyword in K has a count in 
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Z{S) for i e [1,\K\]. Then, 

Pr[Z{S) i Vl\ 

3i e [1, |A'|],Q > t' 

i-Pr[v?;e [i,|i4:|],c, <r' 

r'r' — r 



1 



2A 



-e ^ da; 



1- n 

»e[i,|if|] 

(the noise added to q has to be > r' — r) 



/ 1 

= 1-1 e — ^ 

V 2 

< - — - ■ e >■ 
2 

U ■ m t'-t 

< e ^ 



1^1 



2t 



t/ • TO 

< 



(because |i(r| < U ■ m/T) 



2t 



(by Equation [5]) 



= S. 



(8) 



Next, we wiU show that Equation ^ also holds. 
Let S" be any neighboring search log of S. Let lu be 
any possible output of ZEALOUS given S, such that 
w G fi. To establish Equation ([7]), it suffices to prove 
that 



Pr[Z{S) = uj] 
Pr[Z{S') = Lo] 
Pr[Z{S') ^ uj] 
Pr[Z{S) 



< e'^, and 



< e 



UJ 



(9) 
(10) 

The proof of ^ is 



We will derive Equation (j 
analogous. 

Let H' be the keyword histogram constructed by 
ZEALOUS in Step 2 when apphed to S' . Let A be 
the set of keywords that have different counts in H 
and H' . Since S and S' differ in the search history 
of a single user, and each user contributes at most 
TO keywords, we have |A| < 2to. Let ki [i G [1, |A|]) 
be the i-th keyword in A, and di, d'^, and d* be the 
counts of ki in H, H', and u, respectively. Since a 
user adds at most one to the count of a keyword (see 
Step 2.), we have di — d[ = 1 for any i G [1, |A|]. 
To simplify notation, let Ei, P.-, and Ei* denote 



the event that ki has counts di, d'i, d* in H, H', and 
Z{S),Z{S'), respectively. Therefore, 



PrjZjS) ^ Lo] 
Pr[ZiS') = lo] 



n 

ie[i,|A 



Pr[Ei* I E,] 
Pr[E';\EiY 



In what follows, we will show that p^jg'tj^'j < e^^^ 
for any i G [1, |A|]. We differentiate three cases: (i) 
di > T, d* > r, (ii) di < t and (iii) di — t and 

d*^T- 1. 

Consider case (i) when di and d* are at least r. 
Then, if d* > 0, we have 

PrjE,* I E^] 
Pr\E': I E'^ 

J_„-|rf*-rf,|/A 

_ 

_i_p-\d'-d[\/X 

= g(l<-<|-|<-<i.l)/A 
< f,\d^-d'^\/\ 

— e~ . (because \di — di\ — 1 for any i) 
On the other hand, if d* = 0, 



Pr[Er I E,] X 



r'-d, 1 e-kl/A^^ 

-oo 2A 



J — OO zA 



< e". 



Now consider case (ii) when di is less than r. Since 
w G ri, and ZEALOUS eliminates all counts in H 
that are smaller than r, we have d* = 0, and Pr[E* \ 
Ei] = 1. On the other hand. 



Pr[El* I E^] = 

Therefore, 

PrjEj* I Ei] 
Pr[E'* I 

< 



1, ifrf»<T 
1 - ■ie^l^'^'^^l/-^, otherwise 



< 



< 



1 - 


1 


-d'J/A 


1 - 


2^ 

1 


-r)/A 


1 - 


1 ln(2 
2^ 




1 







(by Equation I?]) 
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Consider now case (iii) when di — t and d* — 
T — 1. Since S O we have d* = 0. Moreover, since 
ZEALOUS ehminates all counts in H that are smaller 
than T, it follows that Pr[E* \ E'^ = 1. Therefore, 

Pr[E'* I E[] ^ ' ^ - 



In summary, p^[g^-»[^/j < e"^/"^. Since |A| < 2to, 
we have 

Pr[Z{S) = oj] 
Pr[Z{S') = uj] 

yr Pr[E* I E,] 
^,11, Pr\ET I E'^ 

< n 

i6[l,|A|] 

< e" (by Equation Eland |A| < 2m). 
This concludes the proof of the theorem. 

B Proofs of the Comparison 
of Indistinguishability and 
Prob. Diff. Privacy 

B.l Proof of Proposition 1 

Assume that, for all search logs S, we can divide the 
output space f2 into to two sets rJi,172, such that 

{1)Py[A{S) e < 6, and 

for all search logs S' differing from S only in the 
search history of a single user and for ail O € fli: 

(2) Pr[^(S') = 0]<e' Pi[A{S') = O] and 
Pr[^(S") = O] < e'Pr[^(S') = O]. 



Consider any subset O of the output space 17 of A. 
Let d = O n f^i and O2 = O n O2. We have 

Pr[^(S') E O] 

= [ Py[AiS) = 0]dO + [ Pt[A{S) = 0]dO 
Joe02 JoeOi 

< [ Pr[A{S) = 0]dO + e'j Pr[^(S") = 0]dO 

< S + e" [ Pr[^(S") = 0]dO 

Joeni 

< 6 + -PvlAiS') eQi]. 

B.2 Proof of Proposition 2 

We have to show that for all search logs S, S' differing 
in one user history and for all sets O : 

Pr[A{S) eO]< Pr[i(S") eO] + 1/{\V\ - 1). 

Since Algorithm 1 neglects all but the first input this 
is true for for neighboring search logs not differing in 
the first user's input. We are left with the case of two 
neighboring search logs S, S' differing in the search 
history of the first user. Let us analyze the output 
distributions of Algorithm 1 under these two inputs S 
and S' . For all search histories except the search his- 
tories of the first user in S, S' the output probability 
is 1/(|P| — 1) for either input. For all search histories 
not in V the output probability is zero. Only for the 
two search histories of the first user 5*1 , S'l the out- 
put probabilities differ: Algorithm 1 never outputs 
S'l given S, but it outputs this search history with 
probability — 1) given 5". Symmetrically, Al- 

gorithm 1 never outputs S[ given S' , but it outputs 
this search history with probability 1/(|2?| — 1) given 
S. Thus, we have for all sets O 

Pr[AiS)eO]= J2 (11) 

deon{v-Si) 

<i/i\v\-i)+ i/(i2^i-i) 

deon{v-S2) 

(12) 

= Pr[i(S) eO] + 1/{\V\ - 1) (13) 
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